About the project

https://github.com/hbhjj/IODS-project

Open Data Science

The aim of this course is to get familiarized with open data science. This includes open data and open source tools.


Week 2 - Regression analysis

This week I studied doing regression analysis with R in the IODS-course. Regression analysis is a method for estimating relationships between variables. In this exercise, we will look at the connection of learning strategies and attitude to the course points of one statistics course

More detailed description of the dataset can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt

lets read the data from the disk

learning2014 <- read.table(file= 'data/learning2014.txt', header=T)

Dimensions and structure of the data

dim(learning2014)
## [1] 166   7

There are 166 observations and 7 variables in the data

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.75 2.88 3.88 3.5 3.75 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

The 7 variables in the dataframe are gender, Age, attitude, deep, stra, surf and Points

Lets check the summary and visualize the data

If not already installed, you should run the command install.packages(c(“ggplot2”,“GGally”)) to install required packages for data visualization

The summaries of the variables

summary(learning2014)
##  gender       Age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.625   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.500   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.875   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.796   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.250   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.875   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Access the GGally and ggplot2 libraries

library(GGally)
library(ggplot2)

draw a scatter plot matrix of the variables in learning2014. [-1] excludes the first column (gender)

pairs(learning2014[-1], col=learning2014$gender)

create a more advanced plot matrix with ggpairs()

p <- ggpairs(learning2014, mapping = aes(col=gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

draw the plot

p

It looks like Points is correlated most heavily with attitude, stra and surf. We will use these variables later in our regression model. Students tend to be fairly young and there is more females than males, which is not that suprising - these are university students, after all. Deep learning has its peak on the higher scores.

Next, we should build a model with the variables described above.

Lets create a regression model with attitude, stra and surf as explanatary variables, Points to be explained and print a summary of it

r_model <- lm(Points ~ attitude + stra + surf, data = learning2014)
summary(r_model)
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

stra and surf did not have statistically significant connection to Points, attitude did. As it is now, the model explains around 0.19 percent of variability of Points is explained by the model according to adjusted R-squared value. 1 point increase in attitude implies a 3.4 point increase in Points if one believes the model. Intercept is around 11, so if explanatory variables are 0, Points should be around 11.

As a final touch, lets see graphical analysis of the results by plotting Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

par(mfrow=c(2,2))
plot(r_model, which= c(1,2,5))

Residuals vs Fitted seems to be fairly random, so the errors should not depend on explanatory variables Q-Q plot seems such that the errors of the model seem fairly normally distributed. There is some deviation in the beginning and the end, but it does not seem dramatic. Residuals vs Residuals plot shows some outliers, but their leverage is not drastically high compared to other datapoints.


Week 3 - Logistic regression

This week, we will explore student alcohol usage. The dataset has, for example, questions related to alcohol usage, social relationships, health and studying. Description of the data can be found here https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION

load dplyr, GGally, tidyr and ggplot2

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
library(tidyr)

read the data from data folder

alc <- read.table(file = 'data/alc.txt', header = T)

glimpse the data

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

Data has 382 observations and 35 variables.

I chose to see if quality of family relationships, family support of education, going out and current health status affect heavy drinking. Rationale behind the choice of first two variables is that may be plausible that social support makes heavy drinking less probable. Going out with friends offers opportunities to drink and there might be social pressure to do so. Overall bad health may lead to drinking. Out of the four variables, family support is binary variables. Family relationships, going out and health are measured on 5 point likert skale.

Now, lets do a subset of data that only has the variables we are interested in and draw bar plots out of them.

alc_sub <- alc[,c("famsup","famrel","goout","health", "high_use")]

# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc_sub) %>% glimpse
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Observations: 1,910
## Variables: 2
## $ key   <chr> "famsup", "famsup", "famsup", "famsup", "famsup", "famsu...
## $ value <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
# draw a bar plot of each variable
gather(alc_sub) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will
## be dropped

Participants tend to have fairly good family relationships. Most have family support for their studies. Health is skewed towards people feeling very healthy: it is not normally distributed. There is less people in high alcohol usage -category. Going out is fairly normally distributed.

Lets see explore the choices a bit

# initialize a plot of high_use and family relations
g1 <- ggplot(alc, aes(x = high_use, y = famrel))


# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("Family relationship")

Good family relationships seem to be connected to not being a high user of alcohol, as hypothesis went

# initialise a plot of high_use and health
g2 <- ggplot(alc, aes(x = high_use, y = health))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Health")

Maybe a bit suprisingly, health seems to be similar for both high users and low users. This might be due to participants feeling quite healthy overall.

Then, going out and drinking

# initialise a plot of high_use and going out
g2 <- ggplot(alc, aes(x = high_use, y = goout))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Going out")

Those goint out with friends more tend to drink more. Not that suprising.

Then, lets see how family support and alcohol usage compare. Let’s do a crosstab out of them

#create crosstab from high use and famsup

xtabs(~alc$high_use+alc$famsup)
##             alc$famsup
## alc$high_use  no yes
##        FALSE  98 170
##        TRUE   46  68

Overall, people not using a lot of alcohol and having family support are the biggest group. Interestingly, proportianally those not getting family support in high use -group are bigger compared to all in high use than in the not high use group.

Building the regression model

We will use logistic regression as a method of choice. Most often this method is used to predict variables that are binary, such as our high_use is. Now we will fit the variables chosen above to regression model and print out a summary of it.

alcm <- glm(high_use ~ famrel + famsup + goout + health, data = alc, family = "binomial")
summary(alcm)
## 
## Call:
## glm(formula = high_use ~ famrel + famsup + goout + health, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5527  -0.8023  -0.5594   0.9769   2.4393  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.42198    0.69344  -3.493 0.000478 ***
## famrel      -0.40343    0.13648  -2.956 0.003117 ** 
## famsupyes   -0.20366    0.25136  -0.810 0.417821    
## goout        0.80717    0.11913   6.776 1.24e-11 ***
## health       0.16986    0.09045   1.878 0.060383 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 404.08  on 377  degrees of freedom
## AIC: 414.08
## 
## Number of Fisher Scoring iterations: 4

Okay, so only family relationship and going out has significance. Lets drop everything else and do the model again with only them as a predictor and print the coefficients.

alcm2 <- glm(high_use ~ famrel + goout, data = alc, family = "binomial")
summary(alcm2)
## 
## Call:
## glm(formula = high_use ~ famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5448  -0.7518  -0.5252   0.9872   2.3530  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.0289     0.6148  -3.300 0.000966 ***
## famrel       -0.3667     0.1338  -2.740 0.006142 ** 
## goout         0.7922     0.1183   6.698 2.11e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 408.12  on 379  degrees of freedom
## AIC: 414.12
## 
## Number of Fisher Scoring iterations: 4
coef(alcm2)
## (Intercept)      famrel       goout 
##  -2.0288785  -0.3666676   0.7921642

As the family relationship gets poorer, likelyhood to use large amounts of alcohol increases. As person goes out more, the likelihood of being a high user of alcohol increases.

Next, lets see confidence intervals for the coefficients as odds ratio.

OR <- coef(alcm2) %>% exp
CI <- exp(confint(alcm2))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1314829 0.03816207 0.4283774
## famrel      0.6930400 0.53161279 0.8999454
## goout       2.2081701 1.76229526 2.8045361

Going out seems to quite heavily increase the tendency for high alcohol consumption. As famrel odds ratio is less than one, it is negatively associated with high alcohol consumption: better family relationships imply a tendency to drink less. Neither of confidence intervals includes 1 which would mean equal odds and thus would indicate unreliability.

How well does the model work in prediction? Maybe the following section can answer to that.

First, confusion matrix.

probabilities <- predict(alcm2, type = "response")

alc <- mutate(alc, probability = probabilities)

alc <- mutate(alc, prediction = probability > 0.5)

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   25
##    TRUE     68   46

Model looks suprisingly good. Let’s plot it.

logmodg <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
logmodg + geom_point()

Okay, how good the model then actually is? Loss function might tell that to us. First though, lets see a table of prediction results

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63612565 0.06544503 0.70157068
##    TRUE  0.17801047 0.12041885 0.29842932
##    Sum   0.81413613 0.18586387 1.00000000

So the model predicted FALSE when it actually was FALSE 64% of time and TRUE when it was TRUE 12% of time.

# the logistic regression model m and dataset alc with predictions are available

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2434555

So overall, the model is wrong 24% of the guesses. This is actually not that bad.

How would this model compare to one where every prediction would be most common category, that of FALSE? Lets find out by creating a new variable only having FALSE guesses and use the loss function with it.

alc$all_false[alc$prediction == TRUE] <- FALSE
alc$all_false[alc$prediction == FALSE] <- FALSE

loss_func(class = alc$high_use, prob = alc$all_false)
## [1] 0.2984293

It would seem that the model is better than just guessing FALSE all the time, but one can get pretty good results with that guess also.

Bonus: Cross validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = alcm2, K = 10)
cv$delta[1]
## [1] 0.2617801

Using 10-fold cross validation, it would seem that my model is fairly similar than the datacamp-model, that had misses of around 26%. It might be that going out is just too good of a predictor: going out with friends may mean for student population drinking with friends and such is not the most useful predictor.


Week 4: Clustering

This week we are going to do clustering.

Load MASS, tidyr, dplyr, corrplot, reshape2 and ggplot2 packages

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

Load the Boston data, print out structure, summary, matrix plot and correlation plot of the variables.

# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# plot matrix of the variables
d <- melt(Boston)
## No id variables; using all as measure variables
ggplot(d,aes(x = value)) + 
    facet_wrap(~variable,scales = "free") + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pairs(Boston)

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# print the correlation matrix
corrplot(cor_matrix, type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle")

The Boston data has 506 rows and 14 columns. The data is about Housing values in Boston and contains variables that may explain some of the variance in the housing values, such as distances to five Boston employment centres and teacher-pupil ratios of the town. Many variables have a bit strange distributions: huge spikes and otherwise fairly low numbers.

As we will be dealing with the crime rate variable, lets discuss it a bit. Crime rate clearly is very high in some areas, crime does not distribute evenly across the ciry

It is not suprising that business areas have higher amount of pollution. Crime rate is negatively correlated with at least distances to employment centres and positively to access to radial highways. Also, richer areas, where the valuable homes are, seem to have less crime.

Explanations of variables, as they are not named intuitively:

crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.

Standardizing the data

As we are going to use LDA, we need to scale the data so that it is normally distributed and each variable has the same variance.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Create a categorical variable out of crim and drop the original. LDA is used to classify categorical variables, so we will turn the crime rate into such. At this point, I will create a copy of standardized data with original crim for the k-means clustering exercise and name it b_scaled2. This way there is no need to reload the data and do the scaling again later on.

# save the scaled crim as scaled_crim and create copy of scaled data
scaled_crim  <- boston_scaled$crim
b_scaled2 <- boston_scaled
# summary of the scaled_crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Create training and test sets out of the scaled data. We will later test our model using the test dataset: we will create the model with the training set. Crime variable is removed from the test dataset.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Fit the linear discriminant analysis on the train set. Here I use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. After this, the results are plotted.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2524752 0.2500000 0.2475248 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       1.01940295 -0.9138673 -0.07742312 -0.8884925  0.45594634
## med_low  -0.09083686 -0.2850728 -0.07933396 -0.5744729 -0.13823258
## med_high -0.37516530  0.1872654  0.15646403  0.3927664  0.06885046
## high     -0.48724019  1.0171519 -0.03610305  1.0473539 -0.36008855
##                 age        dis        rad        tax    ptratio
## low      -0.8673706  0.9018001 -0.7066939 -0.7366214 -0.4431954
## med_low  -0.3497999  0.3758294 -0.5382477 -0.4804428 -0.1230137
## med_high  0.4699678 -0.3601570 -0.4451619 -0.3353818 -0.2968492
## high      0.7985438 -0.8446323  1.6377820  1.5138081  0.7803736
##                black       lstat        medv
## low       0.37795309 -0.76314941  0.52059710
## med_low   0.31799448 -0.11020558 -0.00548579
## med_high  0.08986796  0.04686412  0.17632171
## high     -0.86973711  0.82762720 -0.67573513
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.08836307  0.764886432 -1.028377738
## indus    0.01193862 -0.204711947  0.361078225
## chas    -0.02792670  0.009694380 -0.008519928
## nox      0.38251680 -0.691868835 -1.354136435
## rm       0.01551848 -0.007839186 -0.157240348
## age      0.18486839 -0.360002608 -0.264160777
## dis     -0.10817326 -0.336677179  0.173941126
## rad      3.33730890  0.911247133  0.140000558
## tax      0.04462571 -0.047383483  0.446678345
## ptratio  0.13911550  0.021466845 -0.473024578
## black   -0.11410749  0.029482365  0.153531795
## lstat    0.21839601 -0.255792454  0.341239812
## medv     0.11178389 -0.469169132 -0.230502339
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9502 0.0362 0.0137
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

The plot shows that index of accessibility to radial highways has greatest impact on classification of high crime rate areas.

Now I will predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set. The categories were saved earlier to correct_classes

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      13        1    0
##   med_low    4      13        7    0
##   med_high   0       5       18    2
##   high       0       0        0   27

Model is fairly accurate with its prediction. Noteworthy is that it did not do any mistakes with high crime area predicitions. It is more confused with low and med_low categories.

K-means clustering

Now I will do K-means clustering to the scaled b_scaled data that we saved earlier. I will calculate the distances between the observations and run k-means algorithm on the dataset. After this, I will investigate what is the optimal number of clusters and run the algorithm again.

# euclidean distance matrix
dist_eu <- dist(b_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
# manhattan distance matrix
dist_man <- dist(b_scaled2, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4830 12.6100 13.5500 17.7600 48.8600
# k-means clustering
km <-kmeans(dist_eu, centers = 15)

# plot the scaled Boston dataset with clusters
pairs(b_scaled2, col = km$cluster)

#Optimal cluster amount
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})

# visualize the results
plot(1:k_max, twcss, type='b')

With 15 clusters, it is quite hard to make any sense out of the pairs plot.

After calculating total within sum of squares and plotting it, sharpest drop is between 1 and 2, so 2 is probably the optimal cluster amount.

# k-means clustering
km <-kmeans(dist_eu, centers = 2)


# plot the scaled Boston dataset with clusters
pairs(b_scaled2, col = km$cluster)

Index of accessibility to radial highways seems divide these clusters apart. Regarding the crime rate, one cluster has only low crime rate in it, another one both high and low. Proportion of non-retail businesses also seems to be a clear divider. It would be interesting to see these clusters plotted to an actual map.


Week 5: Dimensionality reduction

Load GGally, tidyr, dplyr, corrplot, reshape2 and ggplot2 packages

library(GGally)
library(tidyr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(reshape2)

Load the data to R and inspect it a bit

human <- read.table(file= 'data/human.txt', header=T, )
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

The data has 155 observations on 8 variables. Each observation in this data represents different country. Edu2.FM and Labo.FM are proportions of females/males with secondary education and how they make up the labour force. Edu.Exp and Life.Exp are expected lenght of education and life expectancy. GNI is short for Gross Domestic Income. Mat.Mor tells maternal mortality ratio, Ado.Birth adolescent birth rate and Parli.F how many percent of the parliament is made out of women.

Graphical overview and summaries of the variables

# visualize the 'human' variables
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Life expectancy and years spent in school correlate fairly heavily, also with GNI. Larger proportions of educated females in comparison to males also correlate with these variables. Interestingly, the same is not true for female/male labour ratio. I have worked with similar data before, and there variable that measured females in non-agricultural jobs correlated with GNI. I suspect that females in agricultural jobs are the culprit here. Both correlate negatively with maternal mortality. GNI, maternal mortality and adolescent birth rate are all negatively skewed on their distributions.

Next, we will do Principal component analysis (PCA) with non scaled data. PCA converts the data to smaller set of variables (principal components) that do not correlate with each other. It is used to bring out patterns in the data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
#see the summary of the model
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1) ,col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Figure 1. A mess due to non-scaled data

Figure 1. A mess due to non-scaled data

Using the unscaled data does not bring out variance explained by the components. The plot shows that this is most likely due to GNI: it has vastly different scale when compared to other variables so the distance that other variables would create to the model does not matter.

Do the same with standardized data, then.

# standardize the variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# print out summaries of the standardized variables and the model
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
pca_human
## Standard deviations:
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation:
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1) ,col = c("grey40", "deeppink2"))
Figure 2. National wellbeing and female participation dimensions

Figure 2. National wellbeing and female participation dimensions

This is much better now. We can see that all the variables are now on the same scale. First dimension still explains most of the variance (around 54 percent), but not all of it. Ratio of females/males of the workforce is heavily loaded to component 2, as is the percentage of females in the parliament. Plot confirms this. However, Parli.F seems to pull slightly towards the countries with high GNI and overall higher life quality meters such as life expectancy, whilst the labour ratio pulls slightly towards qualities such as high maternal mortality.

TeaTime-data and MCA

Now we will do Multiple Correspondence Analysis, which can be used to detect patterns or structure in the data. It can be also used to reduce data dimensions. We will use data that has answers to questionnaire related to tea consumption.

library(FactoMineR)
#load the tea data
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

I want to simplify the dataset a bit, so I will take 12 first variables of the dataset. This is due to these variables holding when and where -type of data, with one variable indicating if drinking tea is done with friends. It is interesting to see if this sort of spatio-temporal mixture of data with a hint of social factors leads to some revelations

library(reshape2)

tea_sub <- tea[,1:12]
d <- gather(tea_sub)
## Warning: attributes are not identical across measure variables; they will
## be dropped
ggplot(d,aes(value)) + 
    facet_wrap("key",scales = "free") + 
    geom_bar()

Most people do not enjoy tea during dinner, and tend to drink it home. Drinking tea at brakfast divides data fairly evenly. Lunch is not tea time for many. Tea is often enjoyed with friends, but not in a pub.

Let’s create the model.

#create mca model
mca <- MCA(tea_sub, graph = FALSE)

# print statistics of the model
summary(mca)
## 
## Call:
## MCA(X = tea_sub, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.170   0.110   0.099   0.090   0.088   0.080
## % of var.             17.038  11.006   9.935   9.027   8.833   7.963
## Cumulative % of var.  17.038  28.044  37.979  47.006  55.839  63.801
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.075   0.066   0.062   0.059   0.056   0.043
## % of var.              7.482   6.629   6.204   5.934   5.624   4.325
## Cumulative % of var.  71.284  77.913  84.117  90.051  95.675 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             | -0.571  0.638  0.571 | -0.346  0.363  0.210 | -0.032
## 2             | -0.571  0.638  0.571 | -0.346  0.363  0.210 | -0.032
## 3             |  0.134  0.035  0.009 |  0.641  1.243  0.206 | -0.220
## 4             | -0.900  1.583  0.488 |  0.394  0.470  0.094 |  0.046
## 5             | -0.290  0.164  0.104 | -0.040  0.005  0.002 |  0.179
## 6             | -0.900  1.583  0.488 |  0.394  0.470  0.094 |  0.046
## 7             | -0.122  0.029  0.036 | -0.389  0.458  0.364 | -0.202
## 8             | -0.265  0.137  0.111 |  0.060  0.011  0.006 | -0.302
## 9             | -0.358  0.251  0.243 | -0.527  0.840  0.525 | -0.174
## 10            | -0.130  0.033  0.017 | -0.129  0.051  0.016 | -0.139
##                  ctr   cos2  
## 1              0.003  0.002 |
## 2              0.003  0.002 |
## 3              0.162  0.024 |
## 4              0.007  0.001 |
## 5              0.107  0.040 |
## 6              0.007  0.001 |
## 7              0.137  0.098 |
## 8              0.306  0.145 |
## 9              0.101  0.057 |
## 10             0.065  0.019 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |   0.148   0.515   0.020   2.460 |  -0.752  20.547   0.522
## Not.breakfast |  -0.137   0.475   0.020  -2.460 |   0.694  18.966   0.522
## Not.tea time  |  -0.594   7.525   0.273  -9.037 |   0.405   5.420   0.127
## tea time      |   0.460   5.833   0.273   9.037 |  -0.314   4.201   0.127
## evening       |   0.491   4.053   0.126   6.142 |   0.584   8.876   0.179
## Not.evening   |  -0.257   2.119   0.126  -6.142 |  -0.306   4.641   0.179
## lunch         |   0.858   5.287   0.127   6.154 |   0.085   0.081   0.001
## Not.lunch     |  -0.148   0.909   0.127  -6.154 |  -0.015   0.014   0.001
## dinner        |  -1.250   5.346   0.118  -5.928 |   1.395  10.318   0.147
## Not.dinner    |   0.094   0.402   0.118   5.928 |  -0.105   0.777   0.147
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast     -12.491 |  -0.100   0.406   0.009  -1.668 |
## Not.breakfast  12.491 |   0.093   0.375   0.009   1.668 |
## Not.tea time    6.164 |   0.302   3.351   0.071   4.605 |
## tea time       -6.164 |  -0.234   2.597   0.071  -4.605 |
## evening         7.306 |  -0.445   5.716   0.104  -5.570 |
## Not.evening    -7.306 |   0.233   2.988   0.104   5.570 |
## lunch           0.612 |  -1.226  18.486   0.258  -8.788 |
## Not.lunch      -0.612 |   0.211   3.177   0.258   8.788 |
## dinner          6.619 |   0.095   0.053   0.001   0.450 |
## Not.dinner     -6.619 |  -0.007   0.004   0.001  -0.450 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.020 0.522 0.009 |
## tea.time      | 0.273 0.127 0.071 |
## evening       | 0.126 0.179 0.104 |
## lunch         | 0.127 0.001 0.258 |
## dinner        | 0.118 0.147 0.001 |
## always        | 0.094 0.024 0.490 |
## home          | 0.001 0.154 0.010 |
## work          | 0.181 0.025 0.082 |
## tearoom       | 0.322 0.000 0.012 |
## friends       | 0.311 0.068 0.003 |

Dimension 1 and 2 seem to be separated by drinking tea during the dinner time the most according to the variables shown in summary. Drinking tea during dinner time is highly loaded to dim 2 as is not drinking tea at the breakfast. Plotting shows that drinking tea home or not is also important for this dimension. Dimension 1 is more cluttered. Drinking tea in tearooms and with friends is more loaded to this component according to eta2 value. Third dimension seems to be related to drinking tea always.

plot(mca, invisible=c("ind"), habillage = "quali")

Inspired by Ville Harjunen (https://villehar.github.io/IODS-project/), I wanted to also try and draw plot that he used with his project. I removed arrows from the plot for clarity. Here also individuals are shown. It seems that there are people who prefer drinking tea only on dinners, and seldom at home. Dimension may represent also generally people who seldom drink tea in comparison to those who drink it everywhere, all the time, culminating to people who actually visit specialized tea houses.

library("devtools")
library("factoextra")

fviz_mca_biplot(mca, axes = c(1, 2), geom = c("point", "text"),
  label = "all")